feat: support embedding and image generation (DRAFT) #85

nabinchha · 2025-12-02T18:49:44Z

Opening a draft PR for initial feedback. I currently have the modality extended to support embedding generation, but we can follow the same pattern to support image generation.

Major change is the need to break out InferenceParameters into generation type specific ones. Changes include renaming existing InferenceParameters -> CompletionInferenceParameters with backwards compatibility + deprecation warning.

I'm working on expanding to image generation in the same PR and will mark this as ready for review when that's done.

Here's an example of what the workflow looks like for embeddings

import json
import pandas as pd
from data_designer.essentials import (
    DataDesigner,
    DataDesignerConfigBuilder,
    EmbeddingColumnConfig,
    EmbeddingInferenceParameters,
    ExpressionColumnConfig,
    GenerationType,
    ModelConfig,
)

model_configs = [
    ModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        generation_type=GenerationType.EMBEDDING,
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    ),
    ModelConfig(
        alias="openai-embedder",
        model="text-embedding-3-small",
        provider="openai",
        inference_parameters=EmbeddingInferenceParameters(
            dimensions=768,
            encoding_format="float"
        )
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

with open("dummy_generated_data.json", "r") as f:
    full_generation_data = json.load(f)

config_builder.with_seed_dataset(
    dataset_reference=DataDesigner.make_seed_reference_from_dataframe(
        pd.DataFrame(full_generation_data),
        "tmp_dedup.json"
    ),
    sampling_strategy="ordered"
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="questions",
        expr="{% for pair in qa_generation.pairs %}{{ pair.question }}\n{% endfor %}"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_nvidia",
        model_alias="nvidia-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_openai",
        model_alias="openai-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

data_designer = DataDesigner()
result = data_designer.preview(config_builder)
result.display_sample_record()

…eInferenceParameters, EmbeddingInferenceParameters

…th BaseInferenceParameters

andreatgretel · 2025-12-02T19:19:25Z

src/data_designer/engine/models/facade.py

+            if response.data and len(response.data) == len(input_texts):
+                return [data["embedding"] for data in response.data]
+            else:
+                raise ValueError(f"Expected {len(input_texts)} embeddings, but received {len(response.data)}")


There might be an issue if response.data = None?

Based on the documentation, upon calling .embedding(...) we're either going to get an EmbeddingResponse object or some exception will be raised. And EmbeddingResponse.data is list... the later check should kick in if that list is empty right?

src/data_designer/engine/dataset_builders/column_wise_builder.py

andreatgretel · 2025-12-02T19:27:53Z

src/data_designer/engine/column_generators/generators/embedding.py

+        input_chunks = [chunk.strip() for chunk in input_chunks if chunk.strip()]
+        embeddings = self.model.generate_text_embeddings(input_texts=input_chunks)
+        data[self.config.name] = {
+            "embeddings": embeddings,


My understanding is that these 3 fields are added as a JSON in a single column, correct? Considering that embeddings is list[list[float]] (?), is that an issue? Could it be the that JSON is added as a string and embeddings is encoded sub-optimally, truncated etc.?

My understanding is that these 3 fields are added as a JSON in a single column, correct?

Yes, that's correct. I'll double check what happens when we serialize these as partial results and report back. I think they were serialized correctly without truncation when I ran some tests.

eric-tramel · 2025-12-03T00:12:25Z

Specifying Model Type

In the current example, the nature of the model must be inferred by the provided inference parameter type. E.g. the only way to know that the following model is, in fact, an Embedding model is to see that the user provided EmbeddingInferenceParameters.

    ModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    )

Another natural thing to do would be to create inheritors of the ModelConfig so you can reference models directly by their config type for certain actions (like selecting which inference endpoints they use etc.):

    EmbeddingModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    )

This would also allow the flexibility later to be able to specify inference_parameters with a raw dict and still know which type under the hood to the cast them to for input verification.

Chunking As A Separate Action

Presently we have to specify some patterns for specifying chunking, but this seems a bit cumbersome to join together the chunking + embedding into a single step.

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_nvidia",
        model_alias="nvidia-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

Another option might be to create a separate ChunkedTextColumnConfig to specify the chunking operation itself (maybe there are other more complex flavors which require their own configuration kwargs). Then, one can chunk once, and perhaps apply multiple different embedders to the same "chunked text column". E.g.

config_builder.add_column(
    ChunkTextColumnConfig(
        name="questions_chunked",
        target_column="question",
        chunk_style="newlines"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_A",
        model_alias="model_A",
        target_column="questions_chunked",
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_B",
        model_alias="model_B",
        target_column="questions_chunked",
    )
)

eric-tramel · 2025-12-03T01:02:15Z

Another advantage of the separate chunking pattern is that we can use it in other contexts, e.g. if we want to extract a random chunk from a document.

nabinchha · 2025-12-03T01:21:14Z

@eric-tramel thanks for the feedback!

Specifying Model Type

IMO, ModelConfig is a simple enough data structure to contain all information regarding the model. Introducing additional layers to it, for example, EmbeddingModelConfig, ImageGenerationModelConfig seems a little complicated especially because despite that we still need classes to contain generation type specific inference parameters such as EmbeddingInferenceParameters, etc.

I initially added ModelConfig.generation_type to help distinguish between the types here, which we only need to perform health checks, but removed it because the type of inference parameters helped answer the same question.

Chunking As A Separate Action

This is a good call out... something I was playing around with. The main goal of this generator is to allow generation of multiple embeddings per cell. Allowing a way to split the content of the target column via regex pattern within embedding generation was something I thought was generic, but I totally see the simplicity of decoupling these two.

In the example you provided, after we use a different generator to do chunking, we still need a way to tell the embedding generator how to discover these different chunks.... unless we explicitly say that EmbeddingColumnConfig expects target columns to be a list of strings or a string. Is that what you had in mind?

eric-tramel · 2025-12-03T01:58:25Z

IMO, ModelConfig is a simple enough data structure to contain all information regarding the model. Introducing additional layers to it, for example, EmbeddingModelConfig, ImageGenerationModelConfig seems a little complicated especially because despite that we still need classes to contain generation type specific inference parameters such as EmbeddingInferenceParameters, etc.

I initially added ModelConfig.generation_type to help distinguish between the types here, which we only need to perform health checks, but removed it because the type of inference parameters helped answer the same question.

Gotcha -- but then how does is this going to be handled at the CLI when inputting model config parameters? Will the user need to specify the kind of model at that point, too?

$ data-designer config models
...
What model type is this?
    -> llm
       vlm
       embedder

In the example you provided, after we use a different generator to do chunking, we still need a way to tell the embedding generator how to discover these different chunks.... unless we explicitly say that EmbeddingColumnConfig expects target columns to be a list of strings or a string. Is that what you had in mind?

Yep, in this case an EmbeddingColumn task could operate on either a string input or a list of strings (which can be represented by some internal data structure/type matching it). In the case of a list of strings, you get the embeddings of the list of strings, otherwise its the embedding of the single string.

It would be interesting in the future to consider the option of some kind of explode feature, like in pandas/polars. This would be entirely optional, but can allow a user to flatten the nesting of their dataset if that's not what they want. For instance, the below pattern would allow one to create a dataset of "all embedding chunks" from a set of source documents without needing to unpack or fiddle with nesting structures themselves.

## Start with N documents in the dataset, now chunk.
config_builder.add_column(
    ChunkTextColumnConfig(
        name="document_chunked",
        target_column="document",
        chunk_params={"style": "contiguous", "max_chars": 4096},
    )
)

"""
Now, assume M chunks / document. We now have N rows where each row contains
document_chunked_row_0 = [chunk_0_0, ..., chunk_0_M]
...
document_chunked_row_N = [chunk_N_0, ..., chunk_N_M]

Next, let's say we want to operate on each chunk independent for the rest of the workflow.
"""

config_builder.explode(name="document_chunked")

"""
Now, we have N*M rows in our dataset generated after document_chunked created.

document_chunked_row_0 = chunk_0_0
document_chunked_row_1 = chunk_0_1
...
document_chunked_row_M = chunk_0_M
...
document_chunked_row_NM = chunk_N_M

And perhaps we want to do our embedding now
"""

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_A",
        model_alias="model_A",
        target_column="questions_chunked",
    )
)

nabinchha · 2025-12-03T16:48:00Z

Will the user need to specify the kind of model at that point, too?

Right, something like that. I hadn't actually thought about the cli, so thanks for raising! It might make sense to hang generation_type off of ModelConfig introduced in this PR to make it more explicit:

class GenerationType(str, Enum):
    CHAT_COMPLETION = "chat-completion"
    EMBEDDING = "embedding"
    IMAGE_GENERATION = "image-generation"
    
class ModelConfig()
    alias: str
    model: str
    generation_type: Optional[GenerationType] = GenerationType.CHAT_COMPLETION
    inference_parameters: InferenceParametersT = Field(default_factory=CompletionInferenceParameters)
    provider: Optional[str] = None
    
    # Validate the type of inference parameters matches generation_type
 
 In the CLI, we'll just prompt the user to choose among the three and tailor inference parameter setup based on that choice. WDYT?

Updated in c6c29d4

nabinchha · 2025-12-03T16:54:38Z

Yep, in this case an EmbeddingColumn task could operate on either a string input or a list of strings (which can be represented by some internal data structure/type matching it). In the case of a list of strings, you get the embeddings of the list of strings, otherwise its the embedding of the single string.

It would be interesting in the future to consider the option of some kind of explode feature, like in pandas/polars. This would be entirely optional, but can allow a user to flatten the nesting of their dataset if that's not what they want. For instance, the below pattern would allow one to create a dataset of "all embedding chunks" from a set of source documents without needing to unpack or fiddle with nesting structures themselves.

The need to support generation of multiple embeddings per row in a single generator is exactly because within the NDDL workflow we don't yet have a way to explode and reduce rows. Until that becomes a reality, keeping the embedding generator simpler (to operate on strings or list of strings) without worrying about chunking should suffice and is generic enough. We can add chunking support in a different PR. Let me update this PR and incorporate your suggestions!

I removed the chunking param/logic in 06a724b. The embedding generator now expects column it targets to have a string or a stringified json list of strings.

…lved based on the type of InferenceParameters

nabinchha · 2025-12-08T22:59:16Z

Closing this PR in favor of #106

nabinchha added 14 commits November 25, 2025 12:16

Add generation type to ModelConfig

dc041f7

pass tests

0d6b830

added generate_text_embeddings

254fd8a

tests

1126ea1

remove sensitive=True old artifact no longer needed

744bc8f

Slight refactor

b913f8d

slight refactor

052db7a

Added embedding generator

5504c8d

chunk_separator -> chunk_pattern

4b6f877

update tests

04fc0f3

rename for consistency

26d6da1

Restructure InferenceParameters -> CompletionInferenceParameters, Bas…

6facbd2

…eInferenceParameters, EmbeddingInferenceParameters

Remove purpose from consolidated kwargs

2c1b267

WithModelConfiguration.inference_parameters should should be typed wi…

4b1492b

…th BaseInferenceParameters

nabinchha requested review from andreatgretel, eric-tramel, johnnygreco and mikeknep December 2, 2025 18:49

andreatgretel reviewed Dec 2, 2025

View reviewed changes

src/data_designer/engine/dataset_builders/column_wise_builder.py Outdated Show resolved Hide resolved

andreatgretel reviewed Dec 2, 2025

View reviewed changes

andreatgretel previously approved these changes Dec 2, 2025

View reviewed changes

Type as WithModelGeneration

c445caf

nabinchha dismissed andreatgretel’s stale review via c445caf December 2, 2025 21:37

Add image generation modality

4b8aa2b

update return type for generate_kwargs

2c5933f

nabinchha added 2 commits December 3, 2025 10:25

make generation_type a field of ModelConfig as opposed to a prop reso…

c6c29d4

…lved based on the type of InferenceParameters

remove regex based chunking from embedding generator

06a724b

nabinchha closed this Dec 8, 2025

nabinchha mentioned this pull request Dec 10, 2025

feat: support native embedding generation #106

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support embedding and image generation (DRAFT) #85

feat: support embedding and image generation (DRAFT) #85

Uh oh!

nabinchha commented Dec 2, 2025 •

edited

Loading

Uh oh!

andreatgretel Dec 2, 2025

Uh oh!

nabinchha Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

andreatgretel Dec 2, 2025

Uh oh!

nabinchha Dec 2, 2025

Uh oh!

eric-tramel commented Dec 3, 2025

Uh oh!

eric-tramel commented Dec 3, 2025

Uh oh!

nabinchha commented Dec 3, 2025

Uh oh!

eric-tramel commented Dec 3, 2025

Uh oh!

nabinchha commented Dec 3, 2025 •

edited

Loading

Uh oh!

nabinchha commented Dec 3, 2025 •

edited

Loading

Uh oh!

nabinchha commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: support embedding and image generation (DRAFT) #85

feat: support embedding and image generation (DRAFT) #85

Uh oh!

Conversation

nabinchha commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreatgretel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

nabinchha Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreatgretel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

nabinchha Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

eric-tramel commented Dec 3, 2025

Specifying Model Type

Chunking As A Separate Action

Uh oh!

eric-tramel commented Dec 3, 2025

Uh oh!

nabinchha commented Dec 3, 2025

Uh oh!

eric-tramel commented Dec 3, 2025

Uh oh!

nabinchha commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabinchha commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nabinchha commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nabinchha commented Dec 2, 2025 •

edited

Loading

nabinchha Dec 2, 2025 •

edited

Loading

nabinchha commented Dec 3, 2025 •

edited

Loading

nabinchha commented Dec 3, 2025 •

edited

Loading